3 days ago i was looking through the gtdbtk manual and saw that
de_novo_wf was an option for analysis to create the trees,
from the description given:
knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")
i beleived this would be something i should do as it might produce
more accurate trees. sample 1Dt2d Enterobacter
cancerogenus had been placed by the classify_wf in the
previous gtdbtk analysis in the genus Pantoea, which lead me to
this search. after a bit of trial and error, i produced this
script
This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were
- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10
I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.
This analysis produced these files:
/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log
./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv
./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv
./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results
./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree
I then moved this gtdbtk.bac120.decorated.tree file into
Dendroscope for review, all 10 are on one tree, but 1Dt2d
is still being placed in the “wrong” genus. on review of its sister
accession on the ncbi database.
On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with
completeness: 90%
contamination: 3.6%
Taxonomy check status: failed
Upon viewing the tree in Dendroscope, the joining node has the label
0.968. This I believe to be the probability the
relationship is correct. this implies they are the same species, and the
online sample is also identified as Enterobacter cancerogenus.
However, due to the checkm analysis i find it plausible that they both
have been misidentified and are in reality Pantoea species, i find this
the most parsimonious explanation. I will follow this up with a CheckM
analysis of my own on 1Dt2d
This was a “technical spike” or proof of concept for
de_novo_wf
📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample
i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.
using this
script i ran a slurm job on hawk under the lineage_wf
of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I
also worked on exporting the data i want to tabulate off of hawk. The
past way i did this was by manually entering each file
and noting down the important characteristics. However, because there
are going to be more samples(and i wanted to be clever) i decided to use
a more automated process. This was done by identifying different
documents in the flye directories on hawk, specifically files called
“assembly_info.txt” which contain the same information, but are vastly
more exportable. These are stored here: cd
/scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this
script i exported them off of hawk.
the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.
with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).
📌 TODO: Complete analysis / processing of checkm and table stuff
Basic stuff, finding out what i can do to the checkm analysis to turn it into publishable data, making the table using R scripts and the output files. exams in January are rapidly approaching, so i really do need to pick up the pace if i want to get this done before i begin revising. there is not a real goal for this exact piece, i just want to know how congruent the outputs from CheckM are to CheckM2.
Using the CheckM
documentation. I found the output that i needed was bin_stats_ext.tsv.
this is where the contamination(and a lot of other fun looking stats)
are held. I can thusly import this file to R to try and extract usefull
stuff / tabulate. Over the course of the day i developed and improved CheckMtablegenerator.R.
which creates the table layed out in results below. This uses a few
packages, one of which was new to me, gt. This is a very
useful package for table generation, here is some info.
| Table 1 - subset of data obtained from CheckM analysis | ||||||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality. | ||||||||||
| Accession | Marker lineage | Completeness | Contamination | GC | GC std | Genome size | contigs | Longest contig | Coding density | Predicted genes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1Dt1h | o__Actinomycetales | 100.00 | 0.00 | 65.519% | 4.284% | 4,336,578 | 4 | 2,051,556 | 90.334% | 3,818 |
| 1Dt2d | f__Enterobacteriaceae | 100.00 | 0.90 | 53.889% | 1.270% | 5,345,294 | 6 | 2,288,975 | 87.931% | 4,937 |
| 1Dt100h | o__Actinomycetales | 98.84 | 0.19 | 52.446% | 0.564% | 2,379,473 | 2 | 2,349,045 | 89.586% | 2,071 |
| 2Dt1l | o__Actinomycetales | 98.82 | 2.32 | 67.742% | 0.844% | 3,996,506 | 16 | 1,089,026 | 93.707% | 3,915 |
| 3Dt1c | o__Sphingomonadales | 97.23 | 1.19 | 66.487% | 1.385% | 3,636,266 | 86 | 275,317 | 90.660% | 3,433 |
| 3Dt2j | o__Sphingomonadales | 96.19 | 1.21 | 65.644% | 1.752% | 2,813,977 | 75 | 195,248 | 92.779% | 2,786 |
| 3Dt2c | o__Sphingomonadales | 95.75 | 0.85 | 66.093% | 1.604% | 3,819,610 | 84 | 247,364 | 89.769% | 3,516 |
| 3Dt2h | o__Actinomycetales | 87.98 | 0.51 | 70.327% | 1.317% | 3,006,227 | 166 | 125,244 | 90.324% | 2,978 |
| 1Dt100g | o__Sphingomonadales | 76.77 | 0.00 | 68.105% | 1.332% | 2,104,071 | 153 | 74,994 | 92.645% | 2,179 |
| 2Dt1e | o__Actinomycetales | 44.74 | 0.00 | 71.162% | 1.565% | 2,332,403 | 149 | 70,689 | 89.402% | 2,215 |
| Source: CheckM lineage_wf performed on 2024-12-25 | ||||||||||
As mentioned, this table is a subset of the data, there was more outputted by CheckM, some of which overlaps with other analyses, but i did not know if it was relevent to include, so left it out, however, creating a new one with the omitted fields in would not be hard. This table will pair with one produced from CheckM2, because from what i’ve seen one is kind of an improvement on the other, but they kind of do different things, its confusing and thus why i want to do the comparison for future reference. I did not know what counted as “significantly complete”, so arbitrarily placed the value at 90%. This analysis shows that 7 of the 10 samples are complete or near enough, with only 3 falling below my line. It should be noted that 2Dt1e, the lowest sample is only found to be 45% complete, this is a large outlier from the rest of the range. Contamination is fairly low in all, none above 3% and 3 samples on 0.00%.
The main purpose of doing this is to compare with Checkm2, so im not sure what conclusions i can get from just this, other than many of the samples look sufficiently complete and i learned a fair amount from this experience. I think this will act as a good first step to compare future analysis against, for confirmation, as many upcoming analyses will give repeat data in some areas. Some of the field, for example “GC” are odd and i dont quite understand their significance, perhaps Aaron can give more info when he gets to this part of the notebook.
📌 TODO: finish the straight flye analysis / start on CheckM2
Today i wanted to finish off the work from yesterday and start on CheckM2. this should produce two tables that i can use to compare with the first one from Checkm, maybe find out which method is superior?
The order i did the two main focuses in shifted greatly during the day, with me bouncing between flye and CheckM2 analyses, so this section will have subheaders for better readability
I created an R script, “rawflyeanalysis.R”. Basic stuff, imports the text files i chose to download(i wonder if i could download the flye.log files and use a sub() to cut out the non-useful stuff). This then sorts and processes the data into the table seen below using a few packages, Table 2: Assembly_info.txt.
Just before lunch(11:45-ish) i set off the slurm job to run CheckM2 on hawk. Let’s see how that goes after lunch, one thing of note though, i did make a minor modification, in the original file, i was running it on the fastas one at a time(?) weird, anyway, i changed it to work on the directory containing all 10, so it should do them all at once. run_checkm2TN.sh.
After lunch, firstly i went back to the flye.log analysis i wanted to do largely for posterity. This was horrifyingly complex due to the “complex” nature of the flye.log files. I did this by creating a modified copy of my assembly_info.txt file grabber shell script, called flye.loggetter.sh, using ctrl + h to find and replace instead of manually changing things. I added this to an existing output directory in my home section of hawk, which i exported using MobaXterm(do i need to include a section on how i use MobaXterm?) to my computer. This is the folder called flyestuff, which contains both the flye.log and the assembly_info.txt files for each accession, i grouped them as it seemed relevent for organisation and didnt hinder the coding process at all. The R file i used to tabulate this is flyeloganalysis.R. This produced Table 3: Flye.log
This started with doing
cd /scratch/scw2160/02_outputs/flye_asm/CheckM2_20241228 to
see the outputs, the job took roughly half an hour, ending at 12:15 pm.
Firstly, i copied the output directory to my home area, then once again
used MobaXterm to bring it down to the git repo, here.
Then, i began work on the R file to turn the
quality_report.tsv file. This file was formatted very well,
so turning it into a table was very simple, done in CheckM2tablegenerator.R.
This produced Table 4: CheckM2
Note: there were other outputs to CheckM2, they were very interesting, im going to make the next section about exploring that possiblity, but they dont relate to this analysis.
| Table 2 - Data obtained from flye outputs | ||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||
| Accession | Number of contigs | Total length | Largest fragment | Mean fragment length |
|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 125,244 | 18,109.80 |
| 1Dt100g | 153 | 2,104,071 | 74,994 | 13,752.10 |
| 2Dt1e | 149 | 2,332,403 | 70,689 | 15,653.71 |
| 3Dt1c | 86 | 3,636,266 | 275,317 | 42,282.16 |
| 3Dt2c | 84 | 3,819,610 | 247,364 | 45,471.55 |
| 3Dt2j | 75 | 2,813,977 | 195,248 | 37,519.69 |
| 2Dt1l | 16 | 3,996,506 | 1,089,026 | 249,781.62 |
| 1Dt2d | 6 | 5,345,294 | 2,288,975 | 890,882.33 |
| 1Dt1h | 4 | 4,336,578 | 2,051,556 | 1,084,144.50 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 1,189,736.50 |
| Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024 | ||||
| coverage could not be calculated from this method as no result matched what was found in the flye.log files | ||||
The basic flye output analysis produced this table, as mentioned, i couldn’t get coverage to match what was on the flye.log file that i got the original of this table from, it is possible that file is the one that is wrong, but genome coverage is kind of all over the place, so i just omitted it as i don’t think its one of the mandatory fields. I chose to sort the table by “Number of contigs” rather arbitrarily, but it does show something interesting in that why is there such a large range in fragment number?
| Table 3 - Data obtained from flye.log outputs | ||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||||
| Accession | Fragments | Total length | Fragments N50 | Largest fragment | Scaffolds | Mean coverage |
|---|---|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 26,385 | 125,244 | 0 | 390 |
| 1Dt100g | 153 | 2,104,071 | 18,022 | 74,994 | 0 | 318 |
| 2Dt1e | 149 | 2,332,403 | 22,011 | 70,689 | 0 | 344 |
| 3Dt1c | 86 | 3,636,266 | 63,365 | 275,317 | 0 | 320 |
| 3Dt2c | 84 | 3,819,610 | 69,733 | 247,364 | 0 | 190 |
| 3Dt2j | 75 | 2,813,977 | 71,970 | 195,248 | 0 | 350 |
| 2Dt1l | 16 | 3,996,506 | 345,773 | 1,089,026 | 0 | 289 |
| 1Dt2d | 6 | 5,345,294 | 1,019,507 | 2,288,975 | 0 | 225 |
| 1Dt1h | 4 | 4,336,578 | 1,857,409 | 2,051,556 | 0 | 404 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 2,349,045 | 0 | 224 |
| Source: flye.log files produced in flye_asm analysis, August 27th 2024 | ||||||
This is a table of the type of stuff i first collated in the Summer, it shares many similarities and cross overs with Table 2, neither have quality statistics, but both are metadata largely concerning the length and fragment, a marked difference is this one comes with a concrete value of coverage. Comparative analysis of them feels like a thing that would happen in the conclusion to this document.
| Table 4 - quality analysis from CheckM2 | |||||
| based on samples taken from the skin microbiome of Dendrobates tinctorius | |||||
| Accession | Completeness | Contamination | Completeness model used | Translation table used | Additional notes |
|---|---|---|---|---|---|
| 1Dt1h | 100.00 | 0.31 | Neural Network (Specific Model) | 11 | None |
| 1Dt2d | 100.00 | 0.87 | Neural Network (Specific Model) | 11 | None |
| 2Dt1l | 100.00 | 0.27 | Neural Network (Specific Model) | 11 | None |
| 3Dt2c | 99.94 | 0.39 | Neural Network (Specific Model) | 11 | None |
| 3Dt2j | 99.80 | 0.00 | Neural Network (Specific Model) | 11 | None |
| 1Dt100h | 99.12 | 0.00 | Gradient Boost (General Model) | 11 | None |
| 3Dt1c | 98.34 | 0.86 | Neural Network (Specific Model) | 11 | None |
| 3Dt2h | 86.04 | 0.10 | Neural Network (Specific Model) | 11 | None |
| 1Dt100g | 78.42 | 0.33 | Neural Network (Specific Model) | 11 | None |
| 2Dt1e | 48.29 | 0.08 | Neural Network (Specific Model) | 11 | None |
| Source: CheckM2 predict performed on 2024-12-28 | |||||
This is the table of CheckM2 analysis, there are many differences in which fields are produced, also, there appears to be differences in the values themselves for Completeness and Contamination, perhaps part of the comparitive analysis, done in the conclusion should be the date at which both were at, as older versions will probably have less accurate values, and this may cause the difference as genomics is a fast evolving field. A Noteable difference is that CheckM used an internal diamond database, whereas CheckM2 required me to download my own. Exact comparison will be conducted in the conclusion.
Thus far, i think that i got all of the fields in Table 2 in a more reliable format in Table 1, but with the quality checks as well, both tables suffer weirdness with “cov.” and “GC” are they the same? are they both coverage? it is as mysterious as it is annoying. Let’s see how they match up to CheckM2. It is getting late, i think that tomorrow is going to be the comparitive analysis and then moving onto the fun-looking checkm2 stuff.
📌 ?: CONCLUSION: how do comparison of all 4 tables next to each other without creating a monstrocity of indented code / something about pills/ find date difference in version history between CheckM and CheckM2 comparitive analysis/
Due to this section being largely about the conclusion from the last week, the other sections may be light, as this is largely just a wrap up i was too tired to do yesterday.
One thing that might be relevent to note is that i changed the Markdown file so that the tables generated above now come from .csv output files. This helps on cutting down the time taken to knit the document and means that there are not multiple instances of code across the document, as the results section down here will include all 4 tables in one form or another. (basically, this increases flexibility of how the data can be used). In order to find the version history for CheckM, i went here ______. For CheckM2, i went here ______ and looked at the comit history as the version on Hawk is a developer version, not one of the official release, beginning at ____.
| Table 5 - Comparison of quality metrics | ||||
| for the same fasta files between CheckM/1.1.3 and CheckM2/0.1.3 | ||||
| Accession |
CheckM/1.1.3
|
CheckM2/0.1.3
|
||
|---|---|---|---|---|
| Completeness | Contamination | Completeness | Contamination | |
| 1Dt100g | 76.77 | 0.00 | 78.42 | 0.33 |
| 1Dt100h | 98.84 | 0.19 | 99.12 | 0.00 |
| 1Dt1h | 100.00 | 0.00 | 100.00 | 0.31 |
| 1Dt2d | 100.00 | 0.90 | 100.00 | 0.87 |
| 2Dt1e | 44.74 | 0.00 | 48.29 | 0.08 |
| 2Dt1l | 98.82 | 2.32 | 100.00 | 0.27 |
| 3Dt1c | 97.23 | 1.19 | 98.34 | 0.86 |
| 3Dt2c | 95.75 | 0.85 | 99.94 | 0.39 |
| 3Dt2h | 87.98 | 0.51 | 86.04 | 0.10 |
| 3Dt2j | 96.19 | 1.21 | 99.80 | 0.00 |
| Source: CheckM lineage_wf performed on 2024-12-25 CheckM2 predict performed on 2024-12-28 |
||||
📌 ?: AFTER LUNCH: date the programs, do the comparison for the “raw”/ no package analyses. / look in log files for diamond version for M and M2
| Table 2 - Data obtained from flye outputs | ||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||
| Accession | Number of contigs | Total length | Largest fragment | Mean fragment length |
|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 125,244 | 18,109.80 |
| 1Dt100g | 153 | 2,104,071 | 74,994 | 13,752.10 |
| 2Dt1e | 149 | 2,332,403 | 70,689 | 15,653.71 |
| 3Dt1c | 86 | 3,636,266 | 275,317 | 42,282.16 |
| 3Dt2c | 84 | 3,819,610 | 247,364 | 45,471.55 |
| 3Dt2j | 75 | 2,813,977 | 195,248 | 37,519.69 |
| 2Dt1l | 16 | 3,996,506 | 1,089,026 | 249,781.62 |
| 1Dt2d | 6 | 5,345,294 | 2,288,975 | 890,882.33 |
| 1Dt1h | 4 | 4,336,578 | 2,051,556 | 1,084,144.50 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 1,189,736.50 |
| Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024 | ||||
| coverage could not be calculated from this method as no result matched what was found in the flye.log files | ||||
| Table 3 - Data obtained from flye.log outputs | ||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius | ||||||
| Accession | Fragments | Total length | Fragments N50 | Largest fragment | Scaffolds | Mean coverage |
|---|---|---|---|---|---|---|
| 3Dt2h | 166 | 3,006,227 | 26,385 | 125,244 | 0 | 390 |
| 1Dt100g | 153 | 2,104,071 | 18,022 | 74,994 | 0 | 318 |
| 2Dt1e | 149 | 2,332,403 | 22,011 | 70,689 | 0 | 344 |
| 3Dt1c | 86 | 3,636,266 | 63,365 | 275,317 | 0 | 320 |
| 3Dt2c | 84 | 3,819,610 | 69,733 | 247,364 | 0 | 190 |
| 3Dt2j | 75 | 2,813,977 | 71,970 | 195,248 | 0 | 350 |
| 2Dt1l | 16 | 3,996,506 | 345,773 | 1,089,026 | 0 | 289 |
| 1Dt2d | 6 | 5,345,294 | 1,019,507 | 2,288,975 | 0 | 225 |
| 1Dt1h | 4 | 4,336,578 | 1,857,409 | 2,051,556 | 0 | 404 |
| 1Dt100h | 2 | 2,379,473 | 2,349,045 | 2,349,045 | 0 | 224 |
| Source: flye.log files produced in flye_asm analysis, August 27th 2024 | ||||||
| Table 1 - subset of data obtained from CheckM analysis | ||||||||||
| of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality. | ||||||||||
| Accession | Marker lineage | Completeness | Contamination | GC | GC std | Genome size | contigs | Longest contig | Coding density | Predicted genes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1Dt1h | o__Actinomycetales | 100.00 | 0.00 | 65.519% | 4.284% | 4,336,578 | 4 | 2,051,556 | 90.334% | 3,818 |
| 1Dt2d | f__Enterobacteriaceae | 100.00 | 0.90 | 53.889% | 1.270% | 5,345,294 | 6 | 2,288,975 | 87.931% | 4,937 |
| 1Dt100h | o__Actinomycetales | 98.84 | 0.19 | 52.446% | 0.564% | 2,379,473 | 2 | 2,349,045 | 89.586% | 2,071 |
| 2Dt1l | o__Actinomycetales | 98.82 | 2.32 | 67.742% | 0.844% | 3,996,506 | 16 | 1,089,026 | 93.707% | 3,915 |
| 3Dt1c | o__Sphingomonadales | 97.23 | 1.19 | 66.487% | 1.385% | 3,636,266 | 86 | 275,317 | 90.660% | 3,433 |
| 3Dt2j | o__Sphingomonadales | 96.19 | 1.21 | 65.644% | 1.752% | 2,813,977 | 75 | 195,248 | 92.779% | 2,786 |
| 3Dt2c | o__Sphingomonadales | 95.75 | 0.85 | 66.093% | 1.604% | 3,819,610 | 84 | 247,364 | 89.769% | 3,516 |
| 3Dt2h | o__Actinomycetales | 87.98 | 0.51 | 70.327% | 1.317% | 3,006,227 | 166 | 125,244 | 90.324% | 2,978 |
| 1Dt100g | o__Sphingomonadales | 76.77 | 0.00 | 68.105% | 1.332% | 2,104,071 | 153 | 74,994 | 92.645% | 2,179 |
| 2Dt1e | o__Actinomycetales | 44.74 | 0.00 | 71.162% | 1.565% | 2,332,403 | 149 | 70,689 | 89.402% | 2,215 |
| Source: CheckM lineage_wf performed on 2024-12-25 | ||||||||||
| Table 4 - quality analysis from CheckM2 | |||||
| based on samples taken from the skin microbiome of Dendrobates tinctorius | |||||
| Accession | Completeness | Contamination | Completeness model used | Translation table used | Additional notes |
|---|---|---|---|---|---|
| 1Dt1h | 100.00 | 0.31 | Neural Network (Specific Model) | 11 | None |
| 1Dt2d | 100.00 | 0.87 | Neural Network (Specific Model) | 11 | None |
| 2Dt1l | 100.00 | 0.27 | Neural Network (Specific Model) | 11 | None |
| 3Dt2c | 99.94 | 0.39 | Neural Network (Specific Model) | 11 | None |
| 3Dt2j | 99.80 | 0.00 | Neural Network (Specific Model) | 11 | None |
| 1Dt100h | 99.12 | 0.00 | Gradient Boost (General Model) | 11 | None |
| 3Dt1c | 98.34 | 0.86 | Neural Network (Specific Model) | 11 | None |
| 3Dt2h | 86.04 | 0.10 | Neural Network (Specific Model) | 11 | None |
| 1Dt100g | 78.42 | 0.33 | Neural Network (Specific Model) | 11 | None |
| 2Dt1e | 48.29 | 0.08 | Neural Network (Specific Model) | 11 | None |
| Source: CheckM2 predict performed on 2024-12-28 | |||||
Table 5 is a direct comparison of each module’s ability to calculate Completeness and Contamination for the same samples. Each column has a data colourisation, the scale for Completeness was 90-100%, so anything appearing in grey i consider “low quality”, the scale for Contamination is 0-3%. Both analyses agree that 3 of the ten accessions (1Dt100g, 2Dt1e and 3Dt2h) do not meet the 90% threshold, and only slightly differ on exact value. 2Dt1e is of special note because it is the only sample below 50% completeness. All values of Completeness are roughly the same, with CheckM2 having slightly higher results across the board. This contrasts with the values of Contamination, in which values differ quite dramatically, with small results getting larger and large results getting smaller. This is seen most in the results for 2Dt1l, in which the CheckM contamination states 2.32%, but this falls sharply in CheckM2 to 0.27%, coincidentally this sample also sees a rise in Completeness from 98.82% to 100%, so perhaps different genes or sequences are differently classified between the two systems. Overall, Contamination is lower in CheckM2 than CheckM.
However, Completeness and Contamination are not the only fields that these modules produce. CheckM returns with a wide array of extra data about the samples (as well as the extra stuff i will discuss in the next section, if that path leads anywhere) including; taxonomy to the order level, accession 1Dt2d even got identified to the family level, as well as many metrics for fragment count and size. It is important to remember that as an undergraduate student i don’t have a full understanding of all of these fields, and this is a subset of the data based upon what i thought was important. This contrasts with CheckM2 which outputs no additional data, however, it does include information about the “Completeness model” and “Translation table” used, which could be useful in fact-checking the results of the analysis.
A secondary analysis i wanted to perform was between the results of the CheckM analysis against that of the manual “assembly_info.txt” and “flye.log” analyses, as they all share overlaps and this could inform future reasearch as to which method is best so all do not need to be performed. All three agree on the number of contigs or fragments (which are either pseudonyms or used interchangeably), as well as the length of the genome and “largest fragment”/ “longest contig”. This is where differences arise in what addition fields each method reveals. The assembly_info.txt files give the mean fragment length, though i am not sure how relevant this is. The flye.log files contain information on “N50”, scaffolds and Mean coverage, i am not sure of the purpose of N50, scaffolds is only ever 0 for all, however mean coverage is perhaps a useful statistic only found in this analysis. Finally, CheckM contains information on “GC”, which i first took to mean Genome Coverage, but upon viewing this page, i discovered meant the proportion of Guanine and Cytosine to Adenine and Thymine, thus GC. This page also contains useful definitions for other fields i did not include, and even some for CheckM2. Other extra fields include coding density and predicted genes, potentially useful.
For the CheckM vs CheckM2 analysis, the version used for CheckM is older, version 1.1.3 was released on the 9th of July 2020, whereas version 0.1.3 of CheckM2 was created on the 18th of July 2022. However, it should be noted both versions found on hawk are out of date from the modern, with the latest version of CheckM being 1.2.3, released on the 25th of June 2024 and the latest version of CheckM2 is version 1.0.2, released on the 19th of May 2023. It would appear that CheckM2 is not a replacement for CheckM, as CheckM is more up to date than CheckM2. Both used their own internal DIAMOND database, dating these is very difficult, so the age of the programs themselves is the only thing i have to go on.
From this information and my analysis of the tables i generated, i conclude that CheckM is the most useful analysis, returning the largest amount of information about a sample and being the most up-to-date. It is difficult to determine which of CheckM and CheckM2 give more reliable results for completeness and contamination, however, if the impact of the analysis had real world implications, i would run them both in parallel due to CheckM2’s more focused nature, this should not add much time to the generation process as both are reletively quick and produce outputs easy to tabulate and analysise. I can confirm that CheckM is a good alternative to manually turning files in the Flye_asm output directories into tables, those ones entailed much difficulty to tabulate and contain less data. However, if there is a situation where coverage is mandatory, the flye.log file is the only one to give a reliable value.
In conclusion, i believe CheckM is the best overall analysis, giving results for all 3 extra analysis combined, in reference to CheckM2 they largely agree in all areas, but if my results had real world consequences i would run both for the conformation, though i can find no way of discerning which method is more accurate. Perhaps in future i can ask for the most modern versions of each to be downloaded to hawk and conduct a comparison on them then. My next projects will be conducted on areas around CheckM due to the discovery of graphical outputs and links to KO pathways, which i hope to yield interesting results. When term restarts i will have to discuss with Aaron the implications for this part of the study. i.e. what does the results of Completeness and contamination mean for our samples? or are these just things we want to know about them?
📌 TODO: (after conc of last bit) investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well / tree_qa command (fun checkM stuff abound). / dendrogram?
In the CheckM documentation for lineage_wf, there are 6 functions, 5 are categorised as M (Mandetory), however a 6th “checkm tree_qa” is marked as “R” and is stated to be optional, looking closer at this function, it has 5 different potential outputs that look interesting, so i figured i would give it a go and see what i could get out of it, sort of exploratory.
The vast majority of the code required was already used for the original checkm bash script, so i copied it and made the modifications, producing this file. I could not figure out a way to run all 5 functions at once, so i had to go into the file and change the name (because these were all outputting to one place so naming the function was required to distinguish)and -o function(how to select the desired output) for each iteration. This was not an issue however as when run as a slurm job on hawk, none of the options took longer than 10 seconds to complete. In order to turn the output from option 2 into a table, i created this R script, which outputs this file. I am not sure if i mentioned it above, but i found a way to do the formatting for my tables in this document, which frees up memory and reduces load times, by loading in a .csv of the cleaned data, instead of cleaning the data and doing the whole script in R, which also creates multiple instances. Basically, it was a whole thing, but now it is not a problem because of some clever coding.
I created an output directory to store all the outputs. As mentioned, there were five; 2 are metadata tables which contain taxonomy data, one basic and one with more data. The next two are .tree files which place the samples in a tree with online references from the IMG database, one had the ID on that system, in the other the tip labels contained the full taxonomy of each sample. It is possible i can use these for a comparative analysis of CheckM and gtdbtk or ncbi and img. The fifth and final output is: “multiple sequence alignment of reference genomes and bins which can be used to infer a de novo genome tree”. It looks to be a fasta file for proteins in each sequence, as of now im unsure what to do with it, i recognise de novo from gtdbtk, is there a link? i will have to consult Aaron for this one.
section finished on 2024-12-31
| Table 6 - subset of data obtained from secondary CheckM analysis | |||
| of samples taken from the skin microbiome of Dendrobates tinctorius, including taxonomy at various levels | |||
| Accession | Unique Markers (of 43) | Taxonomy (Contained) | Taxonomy (Sister Lineage) |
|---|---|---|---|
| 1Dt100g | 41 | k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales | f__Sphingomonadaceae→g__Sphingomonas |
| 1Dt100h | 42 | k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales | f__Dermabacteraceae |
| 1Dt1h | 42 | k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Brevibacteriaceae→g__Brevibacterium | s__Brevibacterium_linens |
| 1Dt2d | 43 | k__Bacteria→p__Proteobacteria→c__Gammaproteobacteria→o__Enterobacteriales→f__Enterobacteriaceae→g__Pantoea | s__ |
| 2Dt1e | 32 | k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Dermabacteraceae | g__Brachybacterium→s__Brachybacterium_faecium |
| 2Dt1l | 42 | k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Microbacteriaceae→g__Microbacterium | unresolved |
| 3Dt1c | 43 | k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales | f__Sphingomonadaceae→g__Sphingomonas |
| 3Dt2c | 43 | k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales | f__Sphingomonadaceae→g__Sphingomonas |
| 3Dt2h | 42 | k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Microbacteriaceae→g__Microbacterium | s__Microbacterium_testaceum |
| 3Dt2j | 43 | k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales | f__Sphingomonadaceae→g__Sphingomonas |
| Source: CheckM tree_qa performed on 2024-12-30 | |||
This is a subset of output 2 showing some taxonomic assignments for the samples, as well as “unique markers”. 9 are above 40, with only 2Dt1e falling below at 32. Output 1 is a simplified version of Output 2, with the same main fields but missing some found in 2. As mentioned, this is a subset, due to space, and some difficulties in coding, i only took the unique fields. The output table also contained data on number of genes and fragment count, GC etc that is already shown in Table 1: CheckM analysis. As well as this, there were some extra fields focussing on the mean values of each field for the groupings each sample were in, i beleive this is the final value in “Taxonomy (Contained)”, again because of space and relevance i cut them from the final product.
Figure 2: Phylogram for 4 samples known to be of the genus Sphingomonas. Generated by checkm tree_qa on the 30th of December 2024
Figure 3: Phylogram for 2 samples known to be of the genus Microbacterium. Generated by checkm tree_qa on the 30th of December 2024.
Figure 4: Phylogram for 1 sample known to be of the genus Brachybacterium. Generated by checkm tree_qa on the 30th of December 2024.
Figure 5: Phylogram for 1 sample known to be of the genus Brevibacterium. Generated by checkm tree_qa on the 30th of December 2024.
Figure 6: Phylogram for 1 sample of disputed genus. Generated by checkm tree_qa on the 30th of December 2024.
Outputs 3 and 4 produce the same tree files, with some minor differences in labeling, so i have only outputted the results from the one that returns the full taxonomic ID of each sample in the tip label. The trees look to be trustworthy, confirming previous taxonomic assignments i have run. It should be noted these use the IMG database, not the NCBI database, this looks to be a smaller database as many of our samples are outgroups on the trees with no direct sister samples, this can be seen most for Figure 2, with all 4 samples in the outgroup. This analysis does confirm again that sample 1Dt2d identified first as “Enterobacter cancerogenus” is in fact a species of the genus “Pantoea”. As mentioned above i am yet to figure out if i can do anything with the output from option 5, so i will discuss with Aaron when term time begins again.
Output 2, and to a lesser extent 1, present interesting data, it is not as useful as the data produced by the main analysis, but i think it was worth doing. It is possible i am not fully apreciating all of the outputs of this. Only very few of the samples get a taxonomy to the species level, however, it does confirm that sample 1Dt2d, marked “Enterobacter cancerogenus” by early taxonomic analysis actually belongs to the genus Pantoea. Confirming the analysis done all the way back in the technical spike for gtdbtk de novo Conclusion. This is mirrored in the outputs of the phylogenetic trees, which are good, but i doubt are a suitable replacement for the ones output by gtdbtk, which i will get onto at some point in the future. I have mentioned 5 several times, so i will just say that i might come back to look at it later. I learned some new things vis a vis new functions to tabulate like changing text size and pruning/collapsing trees in R, so i am less reliant on dendroscope, and have less tree files floating around. Overall, i think this has been a useful thing to do, even if the outputs are not as important or useful as what i have done or am about to do, even just for the experience alone, but who knows, maybe there is something in this data that will be useful to someone reading this.
📌 TODO: investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well / tree_qa command (fun checkM stuff abound). / dendrogram? / did i ever link to the gt() documentation? / ggtree heatmaps?